Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Convert GeoJson Features to GeoArrow (Draft) #2889

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lixun910
Copy link
Collaborator

@lixun910 lixun910 commented Feb 20, 2024

This WIP PR is trying to add functions to convert GeoJson Features to GeoArrow. These functions could be used as:

  • support write GeoArrow file format
  • support arrow-table loader's option in other loaders e.g. GeoJson/Csv to load spatial data into Arrow memory
  • support creating GeoArrow table in memory

I am not confident with the implementation in this PR (missing documentation in js apache-arrow). Please help to take a look if you are interested. Thank you!

@lixun910 lixun910 marked this pull request as draft February 20, 2024 21:52
@lixun910 lixun910 requested a review from kylebarron February 20, 2024 22:02
Copy link
Collaborator

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to handle converting properties to arrow as well? That requires handling at least a few different property types.

Vector as ArrowVector,
makeBuilder
} from 'apache-arrow';
import {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clearer to use import type if (I think) these are only used as types

Comment on lines 36 to 46
let geometryType = '';
for (let i = 0; i < features.length; i++) {
const {geometry} = features[i];
if (i === 0) {
geometryType = geometry.type;
}
if (geometry.type.includes('Multi')) {
geometryType = geometry.type;
break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead I'd suggest building a Set of geometry types instead, because you probably want to error if you see both Point and LineString geometries in a single feature collection (at least until you want to support mixed-geometry arrays, which have a decent amount of complexity).

Comment on lines +86 to +90
// make geoarrow builder
const builder = makeBuilder({
type: pointType,
nullValues: [null]
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw I've never tried the Arrow JS builders.

Like how the geojson-binary geojson conversion is implemented in loaders, in my own work I try to do a two-pass approach where I first count the number of coordinates and thus infer the sizes of buffers needed, and then in a second step allocate the exact size of buffers once and fill them with coordinates.

In your case if you wanted to go that route you could count the number of coordinates you'll have, instantiate a single Float64Array and set the values with coordinates. For nested types you'd have to manage building your own offset buffers as well, which may be a bit more complex than you want.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While your two-pass approach is surely more optimized It would be nice to build a baseline with the builders, as this presumably lets us work with more sanely with complex data types, that can have many index columns.

We can add such optimizations later.

Comment on lines +99 to +100
// build geometry vector
const geometry = builder.finish().toVector();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ever want to handle attributes in the future, ideally each column will have identical chunking. Whereas when you let Arrow JS handle the builders, you don't necessarily have control over the chunking of each column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was assuming that we if we build the columns in parallel, we can choose to stop "building" after a certain number of rows and then create a RecordBatch from the "partial" vectors for each column, and then start building new colums/vectors again, rinse and repeat?

Comment on lines 154 to 190
export function geojsonLineStringToArrow(name: string, lines: Feature[]): GeoArrowReturnType {
// get dimension from the first line
const dimension = (lines[0].geometry as LineString).coordinates[0].length;
const pointFieldName = dimension === 2 ? 'xy' : 'xyz';

// get line type
const nullable = false;
const coordType = new ArrowFloat();
const pointType = new FixedSizeList(
dimension,
new ArrowField(pointFieldName, coordType, nullable)
);
const lineType = new ArrowList(new ArrowField('points', pointType, nullable));

// create a field
const metaData: Map<string, string> = new Map([['ARROW:extension:name', 'geoarrow.linestring']]);
const field = new ArrowField(name, lineType, nullable, metaData);

// make geoarrow builder
const builder = makeBuilder({
type: lineType,
nullValues: [null]
});

for (let i = 0; i < lines.length; i++) {
const coords = (lines[i].geometry as LineString).coordinates;
// @ts-ignore
builder.append(coords);
}

const geometry = builder.finish().toVector();

return {
field,
geometry
};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like each of these 6 functions are exactly the same except for the creation of the types. You might be able to deduplicate the second half (filling) of each of these functions.

@lixun910
Copy link
Collaborator Author

Thank you very much, @kylebarron! Great comments. I will address them. Yes, this PR is an initial work of converting other format e.g. GeoJSON/Csv to arrow, so I plan to handle attributes in the future. Do you know if there is already pure JS library doing this?

Another question: when you mention: in my own work I try to do a two-pass approach where I first count the number of coordinates and thus infer the sizes of buffers needed, could you point to me the code? The implementation in this PR is borrowed from OGR C++ code, which is based on the Apache arrow's C lib. I can't find the documentation of how to use makeBuilder() or how to create a nested list array with offset buffers...

Thanks again!

@kylebarron
Copy link
Collaborator

https://github.com/geoarrow/geoarrow-js/blob/main/src/algorithm/coords.ts has an example of creating arrays from raw Float64Arrays and Int32Arrays for offsets (though offsets there are not modified). You can look at the overloads on makeData for what input each type expects.

Signed-off-by: Xun Li <[email protected]>
Copy link
Collaborator

@ibgreen ibgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start. I'd love to use it for adding arrow output from GeoJSONLoader and CSVLoader etc. For that we also need to be able to create at least a subset of normal / property columns (numbers, booleans, strings, maybe dates).

@@ -0,0 +1,375 @@
import {
Float64 as ArrowFloat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this renaming tends to get excessive, in other files I have been breaking my own rule (not to use wildcard imports) and used import * as arrow and then referenced as arrow.Float. I think this makes the code quite clear.

/**
* Arrow Field and geometry vector
*/
export type GeoArrowReturnType = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more general, overall I would like us to have support for all column types not just GeoArrow geometry:

Suggested change
export type GeoArrowReturnType = {
export type ArrowColumn = {

break;
}
}
if (geometryType === 'Point') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I find switch statments much easier to read.

if (i === 0) {
geometryType = geometry.type;
}
if (geometry.type.includes('Multi')) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more readable?

Suggested change
if (geometry.type.includes('Multi')) {
if (geometry.type.startsWith('Multi')) {

name: string,
points: Array<Feature | GeoJsonObject>
): GeoArrowReturnType {
const firstPoint = isFeature(points[0]) ? points[0].geometry : points[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? Handle lists of just geometries? Do we need such dynamic support? It makes the code a little harder to follow.

Comment on lines +86 to +90
// make geoarrow builder
const builder = makeBuilder({
type: pointType,
nullValues: [null]
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While your two-pass approach is surely more optimized It would be nice to build a baseline with the builders, as this presumably lets us work with more sanely with complex data types, that can have many index columns.

We can add such optimizations later.

Comment on lines +99 to +100
// build geometry vector
const geometry = builder.finish().toVector();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was assuming that we if we build the columns in parallel, we can choose to stop "building" after a certain number of rows and then create a RecordBatch from the "partial" vectors for each column, and then start building new colums/vectors again, rinse and repeat?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants